Representing Text Chunks

نویسندگان

  • Erik F. Tjong Kim Sang
  • Jorn Veenstra
چکیده

Dividing sentences in chunks of words is a useful preprocessing step for parsing, information extraction and information retrieval. (Ramshaw and Marcus, 1995) have introduced a "convenient" data representation for chunking by converting it to a tagging task. In this paper we will examine seven di erent data representations for the problem of recognizing noun phrase chunks. We will show that the the data representation choice has a minor inuence on chunking performance. However, equipped with the most suitable data representation, our memory-based learning chunker was able to improve the best published chunking results for a standard data set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Clustering to Identify Outlier Chunks of Text - Notebook for PAN at CLEF 2011

Intrinsic plagiarism detection is a sub-task of authorship identification in which outlier chunks must be detected solely on the basis of stylistic differences from the main body of the text. We present a first attempt at utilizing words that appear infrequently in a text as stylistic markers for distinguishing outlier chunks in the text. In the first phase of our method we cluster chunks of te...

متن کامل

An Approach to Text Summarization.

We propose an efficient text summarization technique that involves two basic operations. The first operation involves finding coherent chunks in the document and the second operation involves ranking the text in the individual coherent chunks and picking the sentences that rank above a given threshold. The coherent chunks are formed by exploiting the lexical relationship between adjacent senten...

متن کامل

Partial Parsing of Spontaneous Spoken French

This paper describes the process and the resources used to automatically annotate a French corpus of spontaneous speech transcriptions in super-chunks. Super-chunks are enhanced chunks that can contain lexical multiword units. This partial parsing is based on a preprocessing stage of the spoken data that consists in reformatting and tagging utterances that break the syntactic structure of the t...

متن کامل

SemAligner: A Method and Tool for Aligning Chunks with Semantic Relation Types and Semantic Similarity Scores

This paper introduces a ruled-based method and software tool, called SemAligner, for aligning chunks across texts in a given pair of short English texts. The tool, based on the top performing method at the Interpretable Short Text Similarity shared task at SemEval 2015, where it was used with human annotated (gold) chunks, can now additionally process plain text-pairs using two powerful chunker...

متن کامل

Studies for Segmentation of Historical Texts: Sentences or Chunks?

We present some experiments on text segmentation for German texts aimed at developing a method of segmenting historical texts. Since such texts have no (consistent) punctuation, we use a machine learning approach to label tokens with their relative positions in text segments using Conditional Random Fields. We compare the performance of this approach on the task of segmenting of text into sente...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999